I chose the white wine dataset for doing exploratory data analysis using R, given my affinity towards chemistry. The primary question would be which of the chemical properties affects the quality of white wine.
We have loaded the dataset and this is what the data looks like.
## 'data.frame': 4898 obs. of 13 variables:
## $ X : int 1 2 3 4 5 6 7 8 9 10 ...
## $ fixed.acidity : num 7 6.3 8.1 7.2 7.2 8.1 6.2 7 6.3 8.1 ...
## $ volatile.acidity : num 0.27 0.3 0.28 0.23 0.23 0.28 0.32 0.27 0.3 0.22 ...
## $ citric.acid : num 0.36 0.34 0.4 0.32 0.32 0.4 0.16 0.36 0.34 0.43 ...
## $ residual.sugar : num 20.7 1.6 6.9 8.5 8.5 6.9 7 20.7 1.6 1.5 ...
## $ chlorides : num 0.045 0.049 0.05 0.058 0.058 0.05 0.045 0.045 0.049 0.044 ...
## $ free.sulfur.dioxide : num 45 14 30 47 47 30 30 45 14 28 ...
## $ total.sulfur.dioxide: num 170 132 97 186 186 97 136 170 132 129 ...
## $ density : num 1.001 0.994 0.995 0.996 0.996 ...
## $ pH : num 3 3.3 3.26 3.19 3.19 3.26 3.18 3 3.3 3.22 ...
## $ sulphates : num 0.45 0.49 0.44 0.4 0.4 0.44 0.47 0.45 0.49 0.45 ...
## $ alcohol : num 8.8 9.5 10.1 9.9 9.9 10.1 9.6 8.8 9.5 11 ...
## $ quality : int 6 6 6 6 6 6 6 6 6 6 ...
## X fixed.acidity volatile.acidity citric.acid
## Min. : 1 Min. : 3.800 Min. :0.0800 Min. :0.0000
## 1st Qu.:1225 1st Qu.: 6.300 1st Qu.:0.2100 1st Qu.:0.2700
## Median :2450 Median : 6.800 Median :0.2600 Median :0.3200
## Mean :2450 Mean : 6.855 Mean :0.2782 Mean :0.3342
## 3rd Qu.:3674 3rd Qu.: 7.300 3rd Qu.:0.3200 3rd Qu.:0.3900
## Max. :4898 Max. :14.200 Max. :1.1000 Max. :1.6600
## residual.sugar chlorides free.sulfur.dioxide
## Min. : 0.600 Min. :0.00900 Min. : 2.00
## 1st Qu.: 1.700 1st Qu.:0.03600 1st Qu.: 23.00
## Median : 5.200 Median :0.04300 Median : 34.00
## Mean : 6.391 Mean :0.04577 Mean : 35.31
## 3rd Qu.: 9.900 3rd Qu.:0.05000 3rd Qu.: 46.00
## Max. :65.800 Max. :0.34600 Max. :289.00
## total.sulfur.dioxide density pH sulphates
## Min. : 9.0 Min. :0.9871 Min. :2.720 Min. :0.2200
## 1st Qu.:108.0 1st Qu.:0.9917 1st Qu.:3.090 1st Qu.:0.4100
## Median :134.0 Median :0.9937 Median :3.180 Median :0.4700
## Mean :138.4 Mean :0.9940 Mean :3.188 Mean :0.4898
## 3rd Qu.:167.0 3rd Qu.:0.9961 3rd Qu.:3.280 3rd Qu.:0.5500
## Max. :440.0 Max. :1.0390 Max. :3.820 Max. :1.0800
## alcohol quality
## Min. : 8.00 Min. :3.000
## 1st Qu.: 9.50 1st Qu.:5.000
## Median :10.40 Median :6.000
## Mean :10.51 Mean :5.878
## 3rd Qu.:11.40 3rd Qu.:6.000
## Max. :14.20 Max. :9.000
## [1] "X" "fixed.acidity" "volatile.acidity"
## [4] "citric.acid" "residual.sugar" "chlorides"
## [7] "free.sulfur.dioxide" "total.sulfur.dioxide" "density"
## [10] "pH" "sulphates" "alcohol"
## [13] "quality"
Input variables (based on physicochemical tests): 1 - fixed acidity (tartaric acid - g / dm^3) 2 - volatile acidity (acetic acid - g / dm^3) 3 - citric acid (g / dm^3) 4 - residual sugar (g / dm^3) 5 - chlorides (sodium chloride - g / dm^3 6 - free sulfur dioxide (mg / dm^3) 7 - total sulfur dioxide (mg / dm^3) 8 - density (g / cm^3) 9 - pH 10 - sulphates (potassium sulphate - g / dm3) 11 - alcohol (% by volume) Output variable (based on sensory data): 12 - quality (score between 0 and 10)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.000 5.000 6.000 5.878 6.000 9.000
The quality is a discrete variable and range from 3 to 9 with the median and mean quality at 6 and 5.878 respectively. It also appears to have a normal distribution.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.800 6.300 6.800 6.855 7.300 14.200
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0800 0.2100 0.2600 0.2782 0.3200 1.1000
Both fixed and volatile acidity have long positive tails, making the mean higher than the median. Fixed acidity is in the range of 3.8 to 14.2 g/dm3 with a mean of 6.855 g/dm3 and median of 6.8 g/dm3. Excluding the outliers of wines having a fixed acidity above 10 g/ dm3, fixed acidity shows a normal distribution in the range of 5 to 10. Volatile acidity is in the range from 0.08 to 1.10 g/ dm3, with a mean of 0.278 g/ dm3 and a median of 0.260 g/dm3. Excluding the outliers above 0.9 g/dm3, the volatile acidity distribution is slightly bimodal. In general, the volatile acidity is much lower than fixed acidity.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.2700 0.3200 0.3342 0.3900 1.6600
Citric acid shows a long positive tail distribution. In the range of 0 to 0.8 g/dm3, the distribution appears to be normal. There are few points above 0.8 that can be considered as outliers. Some of the wines have no citric acid added as well.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.600 1.700 5.200 6.391 9.900 65.800
The residual sugar distribution in wines is skewed to the left with a mean of 6.391 g/l and a median of 5.200 g/l. There are a lot of wines with the sugar level in the range of 1-2 g/l. There a few outliers noted above 30 g/l.
The x axis was log transformed as the data was skewed to the right. It was interesting to observe a bimodal distribution with a group of sweeter white wines and less sweet white wines.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00900 0.03600 0.04300 0.04577 0.05000 0.34600
The distribution of chlorides is also positively skewed. After removing the top 1% of the values, it still shows a long tail distribution with the majority of the values from 0.01 to 0.10 g/ dm^3. It has a mean of 0.04577 g/dm^3 and a median of 0.043 g/dm^3.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 9.0 108.0 134.0 138.4 167.0 440.0
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.00 23.00 34.00 35.31 46.00 289.00
The distribution for total free sulfur dioxide levels were positively skewed.
Removing the top 1% of the values, the total sulfur dioxide shows a normal distribution. It has a mean of 138.4 ppm and a median of 134 ppm. Removing the top 1% of the values, the free sulfur dioxide distribution has spikes up and down, although it has an overall bell shape curve. The free sulfur dioxide levels are observed to be lower than the total sulfur dioxide levels. It has a mean of 35.31 ppm and a median of 34 ppm.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.9871 0.9917 0.9937 0.9940 0.9961 1.0390
The density distribution is positively skewed. It has a mean of 0.994 and a median of 0.9937.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.720 3.090 3.180 3.188 3.280 3.820
The pH ranges from 2.7 to 3.8 following an almost normal distribution, towards the acidic taste. It has a mean and median of 3.2.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.00 9.50 10.40 10.51 11.40 14.20
The alcohol content ranges from 8% to 14% and the distribution is random and there is a peak at about 9.5%. The mean and median are 10.51% and 10.40% respectively. The minimum alcohol content in white wines is at least 8%.
The quality of the wines are converted to a factor variable and are grouped into 3 categories of poor, neutral and good. A quality rating of 3 and 4 is grouped as ‘poor’ quality, a rating of 5 and 6 is grouped into ‘neutral’ quality and a rating of 7, 8 and 9 are grouped into good quality.
There are 4898 observations and 13 columns. The X (id) and quality are integer values and the rest of the columns are numeric values.
The main area of interest is the wine quality rating given by the wine tasting experts. It will be interesting to see which chemicals in the white wine contribute to a high quality white wine.
Other features that may affect the taste and wine quality could be fixed acidity, citric acid, residual sugar and alcohol content. This will be explored further in bivariate and multivariate analysis.
I created a new factor variable for the quality of the wines to categorise into three categories of poor, neutral and good. With fewer categories, it will be easier to compare the trends of increase in quality vs other input variables.
I did log transformation on the axis that shows the residual sugar histogram as it was skewed towards the right. After the log transformation, I observed it was a bimodal distribution with two peaks at about 2g/l and at about 9 g/l.
##
## 3 4 5 6 7 8 9
## 20 163 1457 2198 880 175 5
## X fixed.acidity volatile.acidity
## X 1.000000000 -0.25581431 0.002857966
## fixed.acidity -0.255814305 1.00000000 -0.022697290
## volatile.acidity 0.002857966 -0.02269729 1.000000000
## citric.acid -0.149899918 0.28918070 -0.149471811
## residual.sugar 0.006623775 0.08902070 0.064286060
## chlorides -0.045645192 0.02308564 0.070511571
## free.sulfur.dioxide -0.011928911 -0.04939586 -0.097011939
## total.sulfur.dioxide -0.161979037 0.09106976 0.089260504
## density -0.185976097 0.26533101 0.027113845
## pH -0.115774132 -0.42585829 -0.031915368
## sulphates 0.009807759 -0.01714299 -0.035728147
## alcohol 0.213656245 -0.12088112 0.067717943
## quality 0.035763247 -0.11366283 -0.194722969
## citric.acid residual.sugar chlorides
## X -0.149899918 0.006623775 -0.04564519
## fixed.acidity 0.289180698 0.089020701 0.02308564
## volatile.acidity -0.149471811 0.064286060 0.07051157
## citric.acid 1.000000000 0.094211624 0.11436445
## residual.sugar 0.094211624 1.000000000 0.08868454
## chlorides 0.114364448 0.088684536 1.00000000
## free.sulfur.dioxide 0.094077221 0.299098354 0.10139235
## total.sulfur.dioxide 0.121130798 0.401439311 0.19891030
## density 0.149502571 0.838966455 0.25721132
## pH -0.163748211 -0.194133454 -0.09043946
## sulphates 0.062330940 -0.026664366 0.01676288
## alcohol -0.075728730 -0.450631222 -0.36018871
## quality -0.009209091 -0.097576829 -0.20993441
## free.sulfur.dioxide total.sulfur.dioxide density
## X -0.0119289106 -0.161979037 -0.18597610
## fixed.acidity -0.0493958591 0.091069756 0.26533101
## volatile.acidity -0.0970119393 0.089260504 0.02711385
## citric.acid 0.0940772210 0.121130798 0.14950257
## residual.sugar 0.2990983537 0.401439311 0.83896645
## chlorides 0.1013923521 0.198910300 0.25721132
## free.sulfur.dioxide 1.0000000000 0.615500965 0.29421041
## total.sulfur.dioxide 0.6155009650 1.000000000 0.52988132
## density 0.2942104109 0.529881324 1.00000000
## pH -0.0006177961 0.002320972 -0.09359149
## sulphates 0.0592172458 0.134562367 0.07449315
## alcohol -0.2501039415 -0.448892102 -0.78013762
## quality 0.0081580671 -0.174737218 -0.30712331
## pH sulphates alcohol quality
## X -0.1157741316 0.009807759 0.21365624 0.035763247
## fixed.acidity -0.4258582910 -0.017142985 -0.12088112 -0.113662831
## volatile.acidity -0.0319153683 -0.035728147 0.06771794 -0.194722969
## citric.acid -0.1637482114 0.062330940 -0.07572873 -0.009209091
## residual.sugar -0.1941334540 -0.026664366 -0.45063122 -0.097576829
## chlorides -0.0904394560 0.016762884 -0.36018871 -0.209934411
## free.sulfur.dioxide -0.0006177961 0.059217246 -0.25010394 0.008158067
## total.sulfur.dioxide 0.0023209718 0.134562367 -0.44889210 -0.174737218
## density -0.0935914935 0.074493149 -0.78013762 -0.307123313
## pH 1.0000000000 0.155951497 0.12143210 0.099427246
## sulphates 0.1559514973 1.000000000 -0.01743277 0.053677877
## alcohol 0.1214320987 -0.017432772 1.00000000 0.435574715
## quality 0.0994272457 0.053677877 0.43557472 1.000000000
The quality of the wine has the highest correlation with alcohol content than other variables. However, let’s take a look of the white wine quality against other factors as well. Other factors are density, pH, citric acid and sulfurdioxides.
##
## poor neutral good
## 183 3655 1060
##
## 3 4 5 6 7 8 9
## 20 163 1457 2198 880 175 5
I converted the quality from an integer to a factor variable so that boxplots can be utilized and it is easier to see the median of the variable of interest. It appears that median alcohol level decreases as the quality increased from 3 to 5. The median alcohol content increased across the quality levels from 5 to 9. At every quality measure, there are large variances in the alcohol content observed, except for quality level of 9. At a quality of 9, the variance in the alcohol content is the lowest and the median alcohol content is the highest.
From the alcohol content vs quality_group variable, the median alcohol content in good quality wines distinctly higher than the ‘poor’ and ‘neutral’ white wines.
## wine_data$quality: 3
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.700 1.588 4.600 6.392 10.700 16.200
## --------------------------------------------------------
## wine_data$quality: 4
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.700 1.300 2.500 4.628 7.100 17.550
## --------------------------------------------------------
## wine_data$quality: 5
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.600 1.800 7.000 7.335 11.500 23.500
## --------------------------------------------------------
## wine_data$quality: 6
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.700 1.700 5.300 6.442 9.900 65.800
## --------------------------------------------------------
## wine_data$quality: 7
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.900 1.700 3.650 5.186 7.325 19.250
## --------------------------------------------------------
## wine_data$quality: 8
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.800 2.100 4.300 5.671 8.200 14.800
## --------------------------------------------------------
## wine_data$quality: 9
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.60 2.00 2.20 4.12 4.20 10.60
This shows the median residual sugar is going up and down across the quality and there are also large variances observed in the residual sugar for each quality measure. The residual sugar decreases as the quality increases from 5 to 9. It is also interesting to note that the variance in the residual sugar and the median (2.20 g/l) is least when the quality is the highest (9).
With the quality group plot vs residual sugar content, the median of the good wines is in between the ‘poor’ and ‘neutral’ wines. The median is 6.200.
The median density of the good wines are the lowest with a median of 0.9917 g/cm3.
Excluding the outliers, the median pH is between 3.1 and 3.3. I really doubt if wine tasting experts could tell differences in pH by 0.1 levels. For a quality of 9, the variance observed is very less. There are a lot of outliers for pH noted when the quality level is from 5 to 7.
## wine_data$quality_group: poor
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 4.200 6.400 6.900 7.181 7.650 11.800
## --------------------------------------------------------
## wine_data$quality_group: neutral
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.800 6.300 6.800 6.876 7.400 14.200
## --------------------------------------------------------
## wine_data$quality_group: good
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.900 6.200 6.700 6.725 7.200 9.200
## wine_data$quality_group: poor
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.110 0.260 0.320 0.376 0.460 1.100
## --------------------------------------------------------
## wine_data$quality_group: neutral
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0800 0.2100 0.2600 0.2771 0.3200 0.9650
## --------------------------------------------------------
## wine_data$quality_group: good
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0800 0.1900 0.2500 0.2653 0.3200 0.7600
The median fixed acidity is the same across the quality levels.
The median volatile acidity is the lowest in the good quality wines. This is consistent with the information, where it says higher levels of volatile acidity can lead to unpleasant and vinegar taste.
## wine_data$quality_group: poor
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.01300 0.03750 0.04600 0.05056 0.05400 0.29000
## --------------------------------------------------------
## wine_data$quality_group: neutral
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00900 0.03700 0.04400 0.04774 0.05100 0.34600
## --------------------------------------------------------
## wine_data$quality_group: good
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.01200 0.03100 0.03700 0.03816 0.04400 0.13500
The good quality wines have the lowest median chloride amount, compared to the poor and neutral quality wines. The median chloride amounts are not distinctly apart from each other. A lot of outliers are noticed in the chloride amounts of good quality wines.
## wine_data$quality_group: poor
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 10.0 85.5 119.0 130.2 177.0 440.0
## --------------------------------------------------------
## wine_data$quality_group: neutral
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 9.0 111.0 140.0 142.6 173.0 344.0
## --------------------------------------------------------
## wine_data$quality_group: good
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 34.0 101.0 122.0 125.2 146.0 229.0
## wine_data$quality_group: poor
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.00 9.00 18.00 26.63 33.50 289.00
## --------------------------------------------------------
## wine_data$quality_group: neutral
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.00 23.00 34.00 35.96 47.00 131.00
## --------------------------------------------------------
## wine_data$quality_group: good
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 5.00 25.00 33.00 34.55 42.00 108.00
## wine_data$quality_group: poor
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.250 0.380 0.470 0.476 0.540 0.870
## --------------------------------------------------------
## wine_data$quality_group: neutral
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.2300 0.4100 0.4700 0.4876 0.5400 1.0600
## --------------------------------------------------------
## wine_data$quality_group: good
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.2200 0.4000 0.4800 0.5001 0.5800 1.0800
The median total sulfur dioxide levels in good wines is 122 mg/dm3, which is in between the poor and the neutral quality groups. However, the variance noted in the good quality wines is lower than the poor and neutral quality. According to the white wines data sheet, it is mentioned that SO2 concentrations above 50 ppm is evident in the smell and taste of the wine. This is perhaps consistent with the lot more outliers (above 200 ppm) present in poor and neutral quality wines than the good wines.
The median free sulfur dioxide is about the same in good and neutral wines, and it is higher than the poor quality wines. The variance noted in the good quality wines is lower.
The median sulphates level across the quality groups are all same.
## [1] 0.8389665
As the residual sugar increases, the density increases. Less variance in density is observed as the residual sugar increases. Perhaps there is also another factor influencing density. This is consistent with the data sheet that tells alcohol content also affects density.
## [1] -0.7801376
As the alcohol content increases the density decreases and it has a strong negative correlation of -0.708.
## [1] 0.615501
As total sulfur dioxide increases, the free sulfur dioxide increases as well. The total sulfur dioxide is the amount of free and bound form of SO2 in the wine. It is not surprising that it has a high correlation. As total sulfur dioxide increases, the variance in free sulfur dioxide increases as well.
## [1] 0.5298813
The density increases as the total sulfur dioxide increases. It is noted that increase in density is related to a decrease in wine quality.
Feature of interest in this case was quality of white wine. The quality of the wine has the highest positive correlation with alcohol content of 0.436 than other variables. The quality of the wine had a clear negative correlation with density of -0.307.
It was interesting that the residual sugar and density had a high positive correlation of 0.84 while the alcohol and density had a negative correlation of -0.78. Given that the fermentation process produces alcohol from the sugars, the more alcohol is produced, the less residual sugars are present.
The free sulfur dioxide and total sulfur dioxide also has a high positive correlation of 0.616. This was expected as the amount of free sulfur dioxide is a subset of total sulfur dioxide. A general intuitive relationship between the acidity and pH is that lower pH values relate to increasing acidity.
The strongest relationship was between the residual sugar and density with a correlation of 0.839.
Taking the residual sugar as constant, the good quality wines have a lower density. It is also noticed that the good quality wines are more on the left side of the residual.sugar vs density plot. This is due to the increased alcohol content with the lower residual sugar.
Keeping total sulfur dioxide constant, higher quality wines are noticed with larger amounts of free sulfur dioxide and the poor quality wines are noticed with lower amounts of free sulfur dioxide.
Alcohol content from 12% to 14% is concentrated where the amount of chlorides is within 0.05 g/dm3. As fixed acidity increases, the pH reduces. All quality of wines seem to share a similar trend. It doesn’t seem to influence the quality of wines greatly.
I thought the quality of white wines was just influenced by alcohol content when I started with the univariate plots. However, it was interesting to discover how density and residual sugar was connected with the alcohol content and they all played a part in influencing the quality of the wines.
Chlorides and sulfur dioxides didn’t have much impact on quality when analysing the bivariate plots. However, they had an effect on the alcohol content, which in turn affected quality.
##
## Calls:
## m1: lm(formula = I(quality) ~ I(alcohol), data = wine_data)
## m2: lm(formula = I(quality) ~ I(alcohol) + density, data = wine_data)
## m3: lm(formula = I(quality) ~ I(alcohol) + density + residual.sugar,
## data = wine_data)
## m4: lm(formula = I(quality) ~ I(alcohol) + density + residual.sugar +
## volatile.acidity, data = wine_data)
## m5: lm(formula = I(quality) ~ I(alcohol) + density + residual.sugar +
## volatile.acidity + chlorides, data = wine_data)
##
## ================================================================
## m1 m2 m3 m4 m5
## ----------------------------------------------------------------
## (Intercept) 0.582 -24.492 88.313 72.225 71.271
##
## I(alcohol) 0.313 0.360 0.246 0.286 0.283
##
## density 24.728 -87.886 -71.546 -70.514
##
## residual.sugar 0.053 0.052 0.052
##
## volatile.acidity -2.059 -2.044
##
## chlorides -0.692
##
## ----------------------------------------------------------------
## Log-likelihood Inf Inf Inf Inf Inf
## Deviance 0.0 0.0 0.0 0.0 0.0
## AIC -Inf -Inf -Inf -Inf -Inf
## BIC -Inf -Inf -Inf -Inf -Inf
## N 4898 4898 4898 4898 4898
## ================================================================
I tried building a linear model of quality against alcohol. I added other terms that might have an effect on the quality, like density, residual.sugar, volatile acidity and chlorides.
The linear model is chosen as it is easy to start with. However, we see that the r-squared value is only 0.3. This means only 30% of the variances in quality can be explained by the independent variables. Another limitation is that the response variable, quality of wines, is a categorical variable. It will definitely differ from person to person.
This shows the alcohol content for good wines is distinctly above the poor and neutral quality white wines. Poor and neutral quality wines have almost the same median value. One possible reason is that only 3% of the data was grouped as poor quality. About 75% of the wines were grouped into the neutral quality.
Density has the highest negative coorelation with the quality of white wines. The density at wine quality 3 and 4 is about the same level as the density at a quality level of 6. From the quality rating of 5 onwards, it is a clear downward trend in density as the wine quality increases.
The third plot shows the interaction of density and residual sugar and the distribution of white wine quality rating. it can be seen that the good quality of wines have lower density for the same amount of residual sugar. It also shows the grouping of quality into fewer categories enabled us to see the results more clearly.
Initially, I did not group the quality of white wines into buckets. When I came to bivariate analysis, I checked the median quality of wines across other input variables. It was difficult to make comparisons as there were seven levels of quality from 3 to 9 and no clear trends could be estabilished. So I categorised the quality into buckets and reduced from 7 different levels to 3 levels and carried out the analysis again. With this change, it was easier to make comparisons across the quality levels.
When I started with univariate plots, going through each variable in the dataset was time-consuming and I was struggling how to do the bivariate plots for every possible combination of variables. The bivariate plots helped me to focus on the pair of variables that had very high positive or negative correlation and ignore the pair of variables with alsmost zeo correlation.
Once a variable (in this case, alcohol content) affecting quality of wines was identified using the bivariate plots, I tracked on other variables affecting this variable. This lead to exploring how these variables interact with one another to affect the quality of wines. This seems to steered me in the right direction and I could complete the rest of the analysis successfully.
A linear regression model attempt shows that only about 30% of the variance in quality is explained all of the independent variables. This leads to more avenues to explore on the weightage of each independent variable that affects the quality of wines. It will also be interesting to see if this is relevant in the red wines data as well.